Large Vocabulary Continuous Speech Recognition for Estonian Using Morpheme Classes
نویسنده
چکیده
This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To improve language model robustness, we automatically find morpheme classes and interpolate the morpheme model with the classbased model. The language model is trained on a newspaper corpus of 15 million word forms. Clustered triphones with multiple Gaussian mixture components are used for acoustic modeling. The system with interpolated morpheme language model is found to perform significantly better than the baseline word form trigram system in all areas. The word error rate of the best system is 27.3% which is a 10.0% absolute improvement over the baseline system.
منابع مشابه
Large Vocabulary Continuous Speech Recognition for Estonian Using Morphemes and Classes
This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To im...
متن کاملLemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages
We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...
متن کاملKorean large vocabulary continuous speech recognition using pseudomorpheme units
This paper presents a Korean large vocabulary continuous speech recognition system based on pseudomorpheme units. In Korean, an eojeol (word phrase) is a unit for spacing and a morpheme is the smallest unit with semantic meaning. If the eojeol is used as the dictionary and language modeling unit, the number of the unit becomes enormous. Instead we propose to use modified morpheme or pseudomorph...
متن کاملLimited-Vocabulary Estonian Continuous Speech Recognition System using Hidden Markov Models
The article presents a limited-vocabulary speaker independent continuous Estonian speech recognition system based on hidden Markov models. The system is trained using an annotated Estonian speech database of 60 speakers, approximately 4 hours in duration. Words are modelled using clustered triphones with multiple Gaussian mixture components. The system is evaluated using a number recognition ta...
متن کاملOn large vocabulary continuous speech recognition of highly inflectional language - czech
A system for large vocabulary continuous speech recognition of highly inflectional language is introduced. Word-based recognition approach is compared with a morpheme-based recognition system. An experiment involving Czech N-best rescoring has been performed with encouraging results.
متن کامل